Introduction to the data

I found three government workforce data sets online from different public websites and downloaded them in *.csv format. You can view these data sets for yourself in my github repo. Unfortunately, these are not relational databases so joins do not make much sense; however, they do contain similar data types and columns so unions will be useful. I really enjoy using RStudio and Markdown, and I will be using dplyr from the tidyverse package to do most of my analytics, but when possible, I will also provide the SQL equivalent code.

I tried to keep a consistent naming convention across all of the data sets - lower case and separated by underscores instead of spaces. It seems that each data set contains a department variable and some sort of salary or wage value, which will make for a good comparison across the cities. The Marin data contains the most descriptive columns, allowing for more interesting analysis. Louisville is the least like the others: it looks at summarized values by department rather than individual. Let’s look at the departments:

Duplicate rows for individuals

In two of the data sets, employees occassionaly have duplicate rows. I would imagine this could happen due to changes in role, salary, or other status type changes. The SQL code to find these individuals would be:

Sample SQL to check for duplicate names

select names, count(*) as num_rows
from bloomington
group by names
having num_rows > 1
Number of Duplicated Names
78

It would make sense to take the maximum value for each person, instead of having the duplicate rows.

Sample SQL to remove duplicate names

select names, departments, max(salaries) as salaries
from bloomington
group by names, departments
Number of Duplicated Names - Bloomington post dedupe
0

After grouping the data by the max for each individual, there were no more duplicates.

A similar check was run for Chicago, and there were no duplicates.

Number of Duplicated Names Chicago
40

Chicago has 40 individuals with multiple rows in a department with the same title, employment type, etc. Again, these probably people who received some sort of salary change. Taking the max salary seems appropriate:

select names
      ,job_titles
      ,departments
      ,wage_types
      ,ft_or_pt
      ,max(hours) as hours
      ,max(salaries) as salaries
      ,max(hourly_rates) as hourly_rates
from chicago
group by names
      ,job_titles
      ,departments
      ,wage_types
      ,ft_or_pt
Number of Duplicated Names Chicago
0

Having resolved the duplicates in Chicago, we can now move one to Marin.

Finally, Marin is a multi-year data set. We should check for duplicates within those years and determine the last date in the date range.

Number of Duplicated Names - Marin
0
Max year in Marin Data
max_year
2020

Summarizing the data

The data sets are all slightly different, so each will have to be summarized in their own way.

Bloomington

Bloomington is a spare data set. It only consists of names, departments, and salaries.

Looking at the summary table, there are clearly some part-time positions that skew the data leftwards. Lots of these appear to be administrative positions that could be flagged manually by department. To save time, I am going to flag them by the 20% quantile.

Looking at the graph above, the 20% quantile is near $25,000, and we’ll flag part-time at that location and then rerun our summary tables.

Chicago

One of the first things that is noticeable about Chicago’s data, is that we have salary and hourly workers. I am going to multiply the hours worked by the hourly rate (times 52 weeks) and put it into the salary column to get it all into one place.

Marin

Marin has no salary numbers and it spans multiple years of data. Let’s examine only one year, to make it 1:1 with the other data sets in regards to counts. Looking at the year 2020, we’ll examine the department by gender.

Chicago Salary Analysis

Here, we looked at 95% prediction interval ranges for these salaries by department to spot individual salaries out of the normal range. You can hover over a point to learn the person’s name who corresponds to that point on the graph.

To create this graph, I found the 95% prediction interval for each department’s salary expectations and then joined that back onto the original data set. The SQL for this join would look something like this:

select chi.*
      ,mdl.*
from chicago as chi
left join model_data as mdl
on chi.departments = mdl.departments

That data can then be used to create the below jittered scatter plot, lattuced by department, with 95% prediction intervals for each department.